Arabic Dialect Identification Using a Parallel Multidialectal Corpus

نویسندگان

  • Shervin Malmasi
  • Eshrag Refaee
  • Mark Dras
چکیده

We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74% accuracy against a random baseline of 16.7% and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94%, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76%). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26k sentences from the radically different AOC dataset with 74% accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97% accuracy. We find that character n-grams are a very informative feature for this task, in both withinand cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined. Keywords-Arabic Dialects; Automatic Dialect Identification; Parallel Corpus; Text Classification;

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Multidialectal Parallel Corpus of Arabic

The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. Howev...

متن کامل

Coling • Acl 2006 Tag + 8

This paper discusses a novel probabilistic synchronous TAG formalism, synchronous Tree Substitution Grammar with sister adjunction (TSG+SA). We use it to parse a language for which there is no training data, by leveraging off a second, related language for which there is abundant training data. The grammar for the resource-rich side is automatically extracted from a treebank; the grammar on the...

متن کامل

Domain and Dialect Adaptation for Machine Translation into Egyptian Arabic

In this paper, we present a statistical machine translation system for English to Dialectal Arabic (DA), using Modern Standard Arabic (MSA) as a pivot. We create a core system to translate from English to MSA using a large bilingual parallel corpus. Then, we design two separate pathways for translation from MSA into DA: a two-step domain and dialect adaptation system and a one-step simultaneous...

متن کامل

Multidialectal Spanish acoustic modeling for speech recognition

During the last years, language resources for speech recognition have been collected for many languages and specifically, for global languages. One of the characteristics of global languages is their wide geographical dispersion, and consequently, their wide phonetic, lexical, and semantic dialectal variability. Even if the collected data is huge, it is difficult to represent dialectal variants...

متن کامل

Monodialectal and multidialectal infants' representation of familiar words.

Monolingual infants are typically studied as a homogenous group and compared to bilingual infants. This study looks further into two subgroups of monolingual infants, monodialectal and multidialectal, to identify the effects of dialect-related variation on the phonological representation of words. Using an Intermodal Preferential Looking task, the detection of mispronunciations in familiar word...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015